# High-resolution Image Processing
Unime LLaVA OneVision 7B
MIT
UniME is a general embedding learning framework based on multimodal large models, significantly enhancing multimodal embedding capabilities through text discriminative knowledge distillation and hard negative sample-enhanced instruction tuning strategies.
Multimodal Alignment
Transformers English

U
DeepGlint-AI
376
2
Unime LLaVA 1.6 7B
MIT
UniME is a general embedding learning model based on a multimodal large model, trained with 336×336 image resolution and ranked first on the MMEB leaderboard.
Image-to-Text
Transformers English

U
DeepGlint-AI
188
3
PE Core B16 224
Apache-2.0
The Perception Encoder is a state-of-the-art image and video understanding encoder trained through simple vision-language learning, achieving top performance across various visual tasks.
Text-to-Image
P
facebook
9,663
11
PE Core L14 336
Apache-2.0
A large-scale visual encoder model developed by Meta, achieving state-of-the-art performance in various vision tasks through contrastive pre-training and fine-tuning on synthetic video data
Text-to-Image
P
facebook
11.52k
34
Aimv2 3b Patch14 224.apple Pt
AIM-v2 is an efficient image encoder model compatible with the timm framework, suitable for computer vision tasks.
Image Classification
Transformers

A
timm
50
0
Featured Recommended AI Models